Supervised learning requires labeled data, which can be expensive to acquire. For example, a dataset with $N$ samples for classification will require manual labeling $N$ times.
One way to ameliorate this issue is to perform clustering of the raw data samples first, followed by manual inspection and labeling of only a few samples. Recall that clustering is a form of non-supervised learning, so it does not require any class labels.
For example, say we are given a set of scanned hand-written digit images. We can cluster them into 10 groups first, manually inspect and label a few images in each cluster, and propagate the label towards the rest of all (unlabeled) samples in each cluster.
The accuracy of such semi-automatic labeling depends on the accuracy of the clustering. If each cluster (0 to 9) corresponds exactly to hand-written digits 0-9, we are fine. Otherwise, we have some mis-labeled data.
The goal of this question is to exercise clustering of the scikit-learn digits dataset which has labels, so that we can verify our clustering accuracy. The specifics are as follows.
You will be judged by the test accuracy of your code, and quality of descriptions of your method. As a reference, a simple code I (Li-Yi) wrote can achieve about 78% accuracy. Try to beat it as much as you can.
We will split the original dataset into training and test datasets
What is your clustering accuracy (comparing cluster labels with the ground truth labels), and what are the properties of mis-clustered samples?
Would the original features (pixels) work well, or we need further processing like scaling/standardization or dimensionality-reduction, before clustering?
Let's focus on k-means clustering, as hierarchical and density-based clustering do not provide the predict() method under scikit-learn.
What is the best test performance you can achieve with which hyper-parameters (for k-means, standard scalar, and dimensionality reduction)?
We have learned Pipeline and GridSearchCV for cross validation and hyper-parameter tuning.
In [1]:
%load_ext watermark
%watermark -a '' -u -d -v -p numpy,pandas,matplotlib,scipy,sklearn
%matplotlib inline
In [2]:
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
In [3]:
import numpy as np
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data # data in pixels
y = digits.target # digit labels
print(X.shape)
print(y.shape)
print(np.unique(y))
In [4]:
import matplotlib.pyplot as plt
import pylab as pl
num_rows = 4
num_cols = 5
fig, ax = plt.subplots(nrows=num_rows, ncols=num_cols, sharex=True, sharey=True)
ax = ax.flatten()
for index in range(num_rows*num_cols):
img = digits.images[index]
label = digits.target[index]
ax[index].imshow(img, cmap='Greys', interpolation='nearest')
ax[index].set_title('digit ' + str(label))
ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()
In [5]:
if Version(sklearn_version) < '0.18':
from sklearn.cross_validation import train_test_split
else:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1)
num_training = y_train.shape[0]
num_test = y_test.shape[0]
print('training: ' + str(num_training) + ', test: ' + str(num_test))
In [6]:
import numpy as np
# check to see if the data are well distributed among digits
for y in [y_train, y_test]:
print(np.bincount(y))
We first write a scoring function for clustering so that we can use for GridSearchCV. Take a look at use_scorer under scikit learn.
In [7]:
## Note: We do not guarantee that there is a one-to-one correspondence, and therefore the toy result is different.
## See Explanation for more information
def clustering_accuracy_score(y_true, y_pred):
n_labels = len(list(set(y_true)))
n_clusters = len(list(set(y_pred)))
Pre = np.zeros((n_clusters, n_labels))
Rec = np.zeros((n_clusters, n_labels))
F = np.zeros((n_clusters, n_labels))
w = np.zeros((n_clusters))
F_i = np.zeros((n_clusters))
P = np.zeros((n_labels))
C = np.zeros((n_clusters))
for i in range(n_clusters):
C[i] = sum(y_pred == i)
for j in range(n_labels):
P[j] = sum(y_true == j)
for i in range(n_clusters):
F_i_max = 0
for j in range(n_labels):
if (C[i]):
Pre[i][j] = sum(y_pred[y_true == j] == i) / C[i]
if (P[j]):
Rec[i][j] = sum(y_true[y_pred == i] == j) / P[j]
if (Pre[i][j]+Rec[i][j]):
F[i][j] = 2*Pre[i][j]*Rec[i][j]/(Pre[i][j]+Rec[i][j])
F_i_max = max(F_i_max, F[i][j])
F_i[i] = F_i_max
w[i] = sum(y_pred == i) / len(y_pred)
return F_i.dot(w)
In [8]:
# toy case demonstrating the clustering accuracy
# this is just a reference to illustrate what this score function is trying to achieve
# feel free to design your own as long as you can justify
# ground truth class label for samples
toy_y_true = np.array([0, 0, 0, 1, 1, 2])
# clustering id for samples
toy_y_pred_true = np.array([1, 1, 1, 2, 2, 0])
toy_y_pred_bad1 = np.array([0, 0, 1, 1, 1, 2])
toy_y_pred_bad2 = np.array([2, 2, 1, 0, 0, 0])
toy_accuracy = clustering_accuracy_score(toy_y_true, toy_y_pred_true)
print('accuracy', toy_accuracy, ', should be 1')
toy_accuracy = clustering_accuracy_score(toy_y_true, toy_y_pred_bad1)
print('accuracy', toy_accuracy, ', should be', 5.0/6.0)
toy_accuracy = clustering_accuracy_score(toy_y_true, toy_y_pred_bad2)
print('accuracy', toy_accuracy, ', should be', 4.0/6.0, ', this will be explained in the following content')
I adopt a modified version of F-value selection, that is, for each cluster, select the best label class with highest F-score. This accuracy calculating method supports the condition that number of clusters not equal to number of labels, which supports GridSearchCV on number of clusters.
Formula: Let $C[i]$ denotes cluster $i$ and $P[j]$ denotes label $j$. Then for each $(i, j)$, we have
$$
\begin{align}
\text{Precision}[i][j] & = \frac{\left|C[i]\cap P[j]\right|}{|C[i]|} \\
\text{Recall}[i][j] & = \frac{\left|C[i]\cap P[j]\right|}{|P[j]|} \\
\text{F-value}[i][j] & = \frac{ 2 \times \text{Precision}[i][j] \times \text{Recall}[i][j]}{\text{Precision}[i][j] + \text{Recall}[i][j]}.
\end{align}
$$
Then for each cluster, we search the best F-value for each label, that is, $$\text{F-value}[i] = \max\limits_{j} \text{F-value}[i][j].$$ We also store the weight $w$ for each cluster, that is, $$w[i] = \frac{|C[i]|}{n}$$
Hence the final score is simply the dot product of $\text{F-value}$ and $w$, which is the weighted F-score.
Again, since this accuracy calculating method supports the condition that number of clusters not equal to number of labels, we do not guarantee that there is a bijection between clusters and labels, therefore there are cases that some labels have no corresponding clusters, as the second toy example shows.
Build a pipeline with standard scaler, PCA, and clustering.
In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import KernelPCA
from sklearn.cluster import KMeans
from sklearn.metrics import make_scorer
from scipy.stats import mode
pipe = Pipeline([('scl', StandardScaler()),
('pca', KernelPCA()),
('km', KMeans(random_state=1))])
# map cluster number to acutal label
def cluster_mapping(y_true, y_pred):
mapping = {}
n_labels = len(list(set(y_true)))
n_clusters = len(list(set(y_pred)))
Pre = np.zeros((n_clusters, n_labels))
Rec = np.zeros((n_clusters, n_labels))
F = np.zeros((n_clusters, n_labels))
P = np.zeros((n_labels))
C = np.zeros((n_clusters))
for i in range(n_clusters):
C[i] = sum(y_pred == i)
for j in range(n_labels):
P[j] = sum(y_true == j)
for i in range(n_clusters):
F_i_max = 0
F_i_max_label = 0
for j in range(n_labels):
if (C[i]):
Pre[i][j] = sum(y_pred[y_true == j] == i) / C[i]
if (P[j]):
Rec[i][j] = sum(y_true[y_pred == i] == j) / P[j]
if (Pre[i][j]+Rec[i][j]):
F[i][j] = 2*Pre[i][j]*Rec[i][j]/(Pre[i][j]+Rec[i][j])
if (F_i_max < F[i][j]):
F_i_max_label = j
F_i_max = F[i][j]
mapping[i] = F_i_max_label
return mapping
Use GridSearchCV to tune hyper-parameters.
In [10]:
if Version(sklearn_version) < '0.18':
from sklearn.grid_search import GridSearchCV
else:
from sklearn.model_selection import GridSearchCV
pcs = list(range(1, 60))
kernels = ['linear', 'rbf', 'cosine']
initTypes = ['random', 'k-means++']
clusters = list(range(10, 20))
tfs = [True, False]
param_grid = [{'pca__n_components': pcs,
'pca__kernel': kernels,
'km__init' : initTypes,
'km__n_clusters' : clusters,
'scl__with_std' : tfs,
'scl__with_mean' : tfs}]
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring=make_scorer(clustering_accuracy_score),
cv=10,
n_jobs=-1,
verbose=False)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
In [11]:
best_model = gs.best_estimator_
print('Test accuracy: %.3f' % clustering_accuracy_score(y_test, best_model.predict(X_test)))
Visualize mis-clustered samples, and provide your explanation.
In [12]:
mapping = cluster_mapping(y_train, best_model.predict(X_train))
y_test_pred = np.array(list(map(lambda x: mapping[x], best_model.predict(X_test))))
miscl_img = X_test[y_test != y_test_pred][:25]
correct_lab = y_test[y_test != y_test_pred][:25]
miscl_lab = y_test_pred[y_test != y_test_pred][:25]
fig, ax = plt.subplots(nrows=5, ncols=5, sharex=True, sharey=True)
ax = ax.flatten()
for i in range(25):
img = miscl_img[i].reshape(8, 8)
ax[i].imshow(img, cmap='Greys', interpolation='nearest')
ax[i].set_title('%d) t: %d p: %d' % (i+1, correct_lab[i], miscl_lab[i]))
ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()
Since the accuracy is 84.4%, which means more than 1 digit will be incorrectly clustered in a group of 10 digits, the error is still considered to be high (compared with using neural networks or other methods). The mis-clustered samples, as we can observe from the picture above, are generally two kinds:
Furthermore, we can find that digit '5' tends to be mis-clustered to digit '9' (e.g. No. 5, No. 17 and No. 18), and digit '9' tends to be mis-clustered to digit '1' and '7' (e.g. No. 1, No. 2, No. 8, No. 19, No. 23 and No. 24). The reason may be that the distance between those digit clusters are not far, and a digit lying around the border can be easily mis-clustered. For example, when the digit is blurred/vague.
Also, we can find from No. 10, No. 14, No. 15 that the digit '1' in the dataset are sometimes twisted, tilted or blurred. It's even hard for a human to tell whether it is digit '1' or not. From the examples, we can see that they tend to be clustered as digit '6' or '2' with respect to their shape and level of clarity.
We will use the CIFAR-10 dataset for image object recognition. The dataset consists of 50000 training samples and 10000 test samples in 10 different classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck; see the link above for more information). The goal is to maximize the accuracy of your classifier on the test dataset after being optimized via the training dataset.
You can use any learning models (supervised or unsupervised) or optimization methods (e.g. search methods for hyper-parameters). The only requirement is that your code can run inside an ipynb file, as usual. Please provide a description of your method, in addition to the code. Your answer will be evaluated not only on the test accuracy but also on the creativity of your methodology and the quality of your explanation/description.
I adopted a Wide Residual Network with 28 convolutional layers and 10x width. As stated Zagoruyko et al's paper [1](#myfootnote1), a 96.11% testing accuracy can be obtained.
Without cifar-10 data size-augmentation[2](#myfootnote2), one of the best test results - a 93.57% testing accuracy is reported[3](#myfootnote3) with the famous deep residual netwook[4](#myfootnote4). Since the desired testing accuracy is high, I decided to start from resnet. However, inspired by Zagoruyko et al, I choose to use a modified version of resent - the Wide Residual Network.
Wide resent basically increases the channel of each convolutional layer while decreasing the depth of the network. With result data from their work, a structure of 28 convolution layers and 10x width with dropout achieves the best result - 96.11% accuracy.
When training my network, however, I only found the result of their early version's paper[5](#myfootnote5), which is training without dropout layer achieves better accuracy (which is also modified from 95.61% to 96.00% in the lastest version). Moreover, due to assignment time constraint, I simply use the default adadelta optimizer instead of [sgd + selected learning rate] in the first 100 itertaions, which may accounts for my less accurate result. After finetuning, in the end, I only get a 94.14% accuarcy, which is still higher than the original resnet.
Using keras.preprocessing.image.ImageDataGenerator
, the data preprocessing scheme I use is
featurewise_center
: Set feature-wise input mean to 0 over the datasetfeaturewise_std_normalization
: Divide inputs by feature-wise std of the datasetrotation_range
: 10$^{\circ}$ range for random rotationswidth_shift_range
: 15.625% random horizontal shiftsheight_shift_range
: 15.625% random vertical shiftshorizontal_flip
: Randomly flip inputs horizontallySince vertical_flip
looks unrealistic for a real-world picture, I did not use it. Also, zca_whitening
may results in loss of image info, and other methods may more or less lead to problems in training.
Moreover, I did not adopte cifar-10 data size-augmentation[2](#myfootnote2), which requires structural changes to my network, which may results in more time cost in training and finetuning. As the paper suggests, the test accuracy should increase compared with non-size-augmentated data.
Since we use 28 convolutional layers, by structure of wide resnet, the layer blocks are:
All convolutional layers are followed by a batch normalization layer and a relu layer.
After all layer blocks, we use average pooling to 8-by-8, followed by a flatten layer, and finalliy a fully connected layer (with softmax) of 10 outputs, which corresponds to each class.
As discussed in 1.2, the most important hyper-parameter of wide resnet, also as Zagoruyko et al discribed, is N (the number of layers), K (the width of the convolutional channel), and dropout ratio. I adopted the best result from version 1 of their work, that is, N = 28, K = 10 (times original resnet), dropout ratio = 0. However, as their version 2 points out, this model with dropout can achieve a better result. Since the difference is just 0.11%, we can see that dropout may not influence much in wide resnet.
You may view the detailed model in the following code output.
Network | Accuracy |
---|---|
Kaiming He et al Resent | 93.57% |
My WRN-28-10 w/o dropout (Adadelta only) | 93.69% |
My WRN-28-10 w/o dropout (Fine-tuned) | 94.14% |
Zagoruyko et al WRN-28-10 v1 w/ dropout | 95.61% |
Zagoruyko et al WRN-28-10 v1 w/o dropout | 95.83% |
Zagoruyko et al WRN-28-10 v2 w/o dropout | 96.00% |
Zagoruyko et al WRN-28-10 v2 w/ dropout | 96.11% |
With adadelta optimizer only, my training accuracy finally becomes 100%, which denotes an over-fitting and thus making the testing accuracy fail to improve. With SGD fine-tuning on the model, the testing accuracy increases around 0.45%. However, my model fail to achieve a testing accuracy above 95% as the papaer suggestes. See 4.2 Limitations for more analysis.
Note that the code I provided below simply uses the fine-tuned result of my network, and the training process is just for showing the result. They are not the original processes.
1: Zagoruyko, Sergey, and Nikos Komodakis. "Wide Residual Networks." arXiv preprint arXiv:1605.07146v2 (2016).
2: Graham, Benjamin. "Fractional max-pooling." arXiv preprint arXiv:1412.6071 (2015).
3: Benenson. What is the class of this image ?" rodrigob@github
4: He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
5: Zagoruyko, Sergey, and Nikos Komodakis. "Wide Residual Networks." arXiv preprint arXiv:1605.07146v1 (2016).
6: Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." IEEE Conference on CVPR. 2015.
7: Dai, Jifeng, et al. "R-FCN: Object Detection via Region-based Fully Convolutional Networks." arXiv preprint arXiv:1605.06409 (2016).
All codes referenced have been specified in their context.
In [1]:
# Functions to build a user-defined wide resnet for cifa-10
# Author: Somshubra Majumdar https://github.com/titu1994/Wide-Residual-Networks
# Modified By: Gao Chang, HKU
from keras.models import Model
from keras.layers import Input, merge, Activation, Dropout, Flatten, Dense
from keras.layers.convolutional import Convolution2D, MaxPooling2D, AveragePooling2D
from keras.layers.normalization import BatchNormalization
from keras import backend as K
def initial_conv(input):
x = Convolution2D(16, 3, 3, border_mode='same')(input)
channel_axis = 1 if K.image_dim_ordering() == "th" else -1
x = BatchNormalization(axis=channel_axis)(x)
x = Activation('relu')(x)
return x
def conv1_block(input, k=1, dropout=0.0):
init = input
channel_axis = 1 if K.image_dim_ordering() == "th" else -1
# Check if input number of filters is same as 16 * k, else create convolution2d for this input
if K.image_dim_ordering() == "th":
if init._keras_shape[1] != 16 * k:
init = Convolution2D(16 * k, 1, 1, activation='linear', border_mode='same')(init)
else:
if init._keras_shape[-1] != 16 * k:
init = Convolution2D(16 * k, 1, 1, activation='linear', border_mode='same')(init)
x = Convolution2D(16 * k, 3, 3, border_mode='same')(input)
x = BatchNormalization(axis=channel_axis)(x)
x = Activation('relu')(x)
if dropout > 0.0: x = Dropout(dropout)(x)
x = Convolution2D(16 * k, 3, 3, border_mode='same')(x)
x = BatchNormalization(axis=channel_axis)(x)
x = Activation('relu')(x)
m = merge([init, x], mode='sum')
return m
def conv2_block(input, k=1, dropout=0.0):
init = input
channel_axis = 1 if K.image_dim_ordering() == "th" else -1
# Check if input number of filters is same as 32 * k, else create convolution2d for this input
if K.image_dim_ordering() == "th":
if init._keras_shape[1] != 32 * k:
init = Convolution2D(32 * k, 1, 1, activation='linear', border_mode='same')(init)
else:
if init._keras_shape[-1] != 32 * k:
init = Convolution2D(32 * k, 1, 1, activation='linear', border_mode='same')(init)
x = Convolution2D(32 * k, 3, 3, border_mode='same')(input)
x = BatchNormalization(axis=channel_axis)(x)
x = Activation('relu')(x)
if dropout > 0.0: x = Dropout(dropout)(x)
x = Convolution2D(32 * k, 3, 3, border_mode='same')(x)
x = BatchNormalization(axis=channel_axis)(x)
x = Activation('relu')(x)
m = merge([init, x], mode='sum')
return m
def conv3_block(input, k=1, dropout=0.0):
init = input
channel_axis = 1 if K.image_dim_ordering() == "th" else -1
# Check if input number of filters is same as 64 * k, else create convolution2d for this input
if K.image_dim_ordering() == "th":
if init._keras_shape[1] != 64 * k:
init = Convolution2D(64 * k, 1, 1, activation='linear', border_mode='same')(init)
else:
if init._keras_shape[-1] != 64 * k:
init = Convolution2D(64 * k, 1, 1, activation='linear', border_mode='same')(init)
x = Convolution2D(64 * k, 3, 3, border_mode='same')(input)
x = BatchNormalization(axis=channel_axis)(x)
x = Activation('relu')(x)
if dropout > 0.0: x = Dropout(dropout)(x)
x = Convolution2D(64 * k, 3, 3, border_mode='same')(x)
x = BatchNormalization(axis=channel_axis)(x)
x = Activation('relu')(x)
m = merge([init, x], mode='sum')
return m
def WRN(nb_classes, N, k, dropout):
"""
Creates a Wide Residual Network with specified parameters
:param nb_classes: Number of output classes
:param N: Depth of the network. Compute N = (n - 4) / 6.
Example : For a depth of 16, n = 16, N = (16 - 4) / 6 = 2
Example2: For a depth of 28, n = 28, N = (28 - 4) / 6 = 4
Example3: For a depth of 40, n = 40, N = (40 - 4) / 6 = 6
:param k: Width of the network.
:param dropout: Adds dropout if value is greater than 0.0
"""
init = Input(shape=(3, 32, 32))
x = initial_conv(init)
for i in range(N):
x = conv1_block(x, k, dropout)
x = MaxPooling2D((2,2))(x)
for i in range(N):
x = conv2_block(x, k, dropout)
x = MaxPooling2D((2,2))(x)
for i in range(N):
x = conv3_block(x, k, dropout)
x = AveragePooling2D((8,8))(x)
x = Flatten()(x)
x = Dense(nb_classes, activation='softmax')(x)
model = Model(init, x)
return model
In [2]:
import numpy as np
import sklearn.metrics as metrics
from keras.datasets import cifar10
from keras.models import Model
from keras.layers import Input
from keras.optimizers import SGD
import keras.callbacks as callbacks
import keras.utils.np_utils as kutils
from keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import accuracy_score
batch_size = 64
nb_epoch = 5
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
X_train = X_train.astype('float32')
X_train /= 255.0
X_test = X_test.astype('float32')
X_test /= 255.0
y_train = kutils.to_categorical(y_train)
y_test = kutils.to_categorical(y_test)
generator = ImageDataGenerator(featurewise_center=True,
featurewise_std_normalization=True,
rotation_range=10,
width_shift_range=5./32,
height_shift_range=5./32,
horizontal_flip=True)
generator.fit(X_train, seed=0, augment=True)
test_generator = ImageDataGenerator(featurewise_center=True,
featurewise_std_normalization=True)
test_generator.fit(X_test, seed=0, augment=True)
model = WRN(nb_classes=10, N=4, k=10, dropout=0.0)
model.summary()
In [3]:
print ("Start Training:")
sgd = SGD(lr = 0.001, decay = 0.1, momentum = 0.9, nesterov = True)
model.compile(loss="categorical_crossentropy", optimizer=sgd, metrics=["acc"])
# model.load_weights("WRN-28-10.h5")
model.fit_generator(generator.flow(X_train, y_train, batch_size=batch_size),
samples_per_epoch=len(X_train),
nb_epoch=nb_epoch,
# callbacks=[callbacks.ModelCheckpoint("WRN-28-10-Best.h5", monitor="val_acc", save_best_only=True)],
validation_data=test_generator.flow(X_test, y_test, batch_size=batch_size),
nb_val_samples=X_test.shape[0],
verbose = True)
In [6]:
print ("Start Testing:")
# model.load_weights("WRN-28-10.h5")
results = model.evaluate_generator(test_generator.flow(X_test, y_test, batch_size=batch_size), X_test.shape[0])
print ("Results:")
print ("Test loss: {0}".format(results[0]))
print ("Test accuracy: {0}".format(results[1]))